Skip to main content

Data in, data out

Some terms​

TermDescription
DocumentTop-level / root object that is serialized into JSON and stored on ES (has an ID)
IndexLogical namespace of a group of shards
IndexedStored and made searchable

Metadata​

  • Required fields:
    • index: database
    • type: something like a schema
    • id: unique string to identify a document
  • These 3 fields uniquely identify a document
  • Others (covered in future chapter)
info

Data is stored and indexed in shards

Specifying own ID​

  • PUT verb (store this document AT this URL) e.g. PUT /{index}/{type}/{id}

Autogenerating ID​

  • POST verb (store this document under this URL) e.g. POST /{index}/{type}
  • Autogenerated ID:
    • 22 characters long
    • URL safe
    • Base64 encoded string UUIDs

Retrieving data​

Entire document​

  • HTTP GET

Checking if document exists​

  • Use HEAD instead of GET

Deleting a document​

Sample response if document found

{
"found" : true,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3
}

Sample respones if document not found

{
"found" : false,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 4
}
note

Version is incremented even if not found => internal bookkeeping for ensuring changes are applied in correct order across multiple nodes

Updating a document​

  • Documents are immutable
  • Reindex / replace document when updating
    • Version number updated
  • Internally (Done in a single API)
    • Retrieve old document
    • Change
    • Delete old document
    • Index new document
  • Adopts a last-write-wins approach by default
  • Uses opimistic concurrency control if version parameter specified
info

Internally, ES will mark the old document as deleted and has added an entirely new document (will be eventually deleted as more data is added)

Dealing with conflicts​

Approaches to deal with concurrent updates to ensure that no data is lost

Pessimistic concurrency control​

  • Assumes conflicting are likely to happen
  • Blocks access to a resource in order to prevent conflicts
  • e.g. locking a row before reading data

Optmistic concurrency controls (Used by ES)​

  • Assumes conflicts are unlikely
  • Doesn't block operations
  • Underlying data modified between reading and writing => update fails
Handling failure

It's up to the application to handle the failure

  • Reattempt update
  • Report failure to user
    • ...

Example

  • Init document
  • Update document PUT /website/blog/1?version=1 => version update to 2
  • Updated document PUT /website/blog/1?version=1 => error

Using versions from external system​

  • Common setup: use some other DB as primary data source, ES to make data searchable
  • Can use version number of main DB with ES (e.g. timestamp)
  • Handling by ES is a bit different => checks that current _version is less than specified version
PUT /website/blog/2?version=5&version_type=external

Partial Updates​

  • Retrieve-change-reindex process as well
  • Happens within a shard => avoid network overhead of multiple requests => reduce likelihood of conflicting changes

Using scripts​

  • Actually, don't really get the benefit as well
  • Default scripting language: Groovy (Runs in a sandbox to prevent malicious users from ES and attacking the server)
  • Don't really get this as well

Upsert​

  • Updating a nonexisting document will fail
  • Specify upsert parameter to create document if it doesn't exist

POST /website/pageviews/1/_update

{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 1
}
}

Updates and conflicts​

  • Smaller window between retrieve / reindex => smaller opportunity for conflicting changes
  • But doesn't mean zero chance
  • For cases whereby it doesn't matter that a document has been changed, can just reattempt

POST /website/pageviews/1/_update?retry_on_conflict=5

Retry this update five times before failing

{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 0
}
}

Retrieving Multiple Documents​

  • MGET => Avoids network overhead
  • Expects a docs array of required metadata
  • Response is successful even if there are missing documents
  • Need to rely on found flag

bulk API​

  • Allow multiple create, index, update, delete requests in a single step
{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\n
...
  • Every line (including the last line) must end with \n for efficient line separation
  • Cannot contain unescaped newline characters (must not be pretty printed, will interfere with parsing)

DRY​

  • bulk API accepts a default _index or _index/_type

How big is too big​

  • Entire bulk request needs to be loaded into meomry => req too big => less memory available for other requests
info

Excerpt from book: Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.

A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.

It is often useful to keep an eye on the physical size of your bulk requests. One thou‐ sand 1KB documents is very different from one thousand 1MB documents.

A good bulk size to start playing with is around 5-15MB in size